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Abstract 


As far as we know, in the open scientific literature, there is no gen- 
eralized framework for the optimization of relational data warehouses 
which includes view and index selection and vertical view fragmen- 
tation. In this paper we are offering such a framework. We propose 
a formalized multidimensional model, based on relational schemas, 
which provides complete vertical view fragmentation and presents an 
approach of the transformation of a fragmented snowflake schema to 
a defragmented star schema through the process of denormalization. 

We define the generalized system of relational data warehouses op- 
timization by including vertical fragmentation of the implementation 
schema (F'), indexes (I) and view selection (S) for materialization. We 
consider Genetic Algorithm as an optimization method and introduce 
the technique of ”recessive bits” for handling the infeasible solutions 
that are obtained by a Genetic Algorithm. We also present two novel 
hybrid algorithms, i.e. they are combination of Greedy and Genetic 
Algorithms. 

Finally, we present our experimental results and show improve- 
ments of the performance and benefits of the generalized approach 
(SFI) and show that our novel algorithms significantly improve the 
efficiency of the optimization process for different input parameters. 
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1 Introduction 


This paper represents a collection and extension of four papers that we have 
recently published [19, 20, 21, 22]. 

The performance of the system of RDW (Relational Data Warehouses) 
depends on many factors which makes its optimization a very complex and 
challenging problem. The main elements of a system for RDW optimiza- 
tion are: definition of a solution space, objective function and choice of an 
optimization method. 

The solution space includes factors relevant for RDW performance as 
view and index selection and view fragmentation, i.e. partitioning. In some 
existing approaches the solution space for the problem of optimization of 
data warehouses is simplified to a great extent, and the selection of views 
or indexes is studied without considering other factors [1, 2, 6, 14, 15, 23}. 
These approaches are important for theoretical research, but are not appli- 
cable in practice. In some papers the view selection problem is generalized 
by including a proper set of indexes for each view and selection of views 
and indexes is done together [10, 20]. If the selection of views and indexes 
is performed separately and the set of indexes is added to the optimal set 
of views, then the common set might not have optimal performances. In [3] 
the problem of optimization of horizontal scheme partitioning was defined. 
The optimization problem of data warehouses as a combination of mate- 
rialized views, indexes and horizontal data partitioning was introduced in 
[4] and the approach of vertical fragmentation was introduced in [10]. But 
none of them include selections of views and indexes and complete vertical 
fragmentation. 

In this paper we present a solution of an optimization problem of RDW 
performance, based on a multi-dimensional model that includes complete 
vertical fragmentation. The model provides definition of all possible aggre- 
gate views and their data dependencies. It also includes all possible states 
of (de)normalization of the schema, from fully normalized snowflake to fully 
denormalized star schema. The solution space of the optimization problem 
includes aggregate views and all states of fragmentation (variants) through 
the process of denormalization of the views. Further, we define an indexing 
strategy for aggregate views, by including traditional B-tree indexes and 
advanced techniques that include bitmapped indexes. We name our opti- 
mization system - SFT since it includes selection of views (S), their vertical 
fragmentation (F) and their indexing (I) is introduced. 

The objective function evaluates the quality of a solution. In [1, 2, 9, 
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10, 23] a simplified objective function (linear cost model) was used. They 
suppose that the query processing cost is the number of rows that are pro- 
cessed. The linear cost model is not adequate for evaluating join queries 
with complex selection predicates. The optimization is usually considered 
with a certain constraints which divide all solutions (whole solution space) 
into two groups of solutions: feasible and infeasible. There are two types of 
constraints: real system and logical constraints. System constraints are well 
studied in existing research prototypes [1, 2, 9, 10, 20, 22, 23]. They can 
be disk space or maintenance cost constraints. In [1, 2, 10] the optimiza- 
tion system was used under disk space constraints and formal constraints 
were embedded in the penalty function. In [9, 23] the objective function 
involved query processing cost under views maintenance (refreshment) cost 
constraint. In [23] the constraints were incorporated into the algorithm 
through a stochastic ranking procedure. 

In this paper we use a non-linear cost model under the views of the 
maintenance cost constraint. The system was tested for complex workload, 
i.e. queries with projection, selection, join and grouping operations and with 
complex selection predicates. Also the logical constraints were considered 
and effects of solutions which violate logical constraints to the optimization 
process were analyzed. 

Some types of Genetic Algorithms - GA as optimization method were 
used in [3, 14, 16, 20, 23, 24], and greedy algorithm with its variants and 
some heuristic searching techniques were used in [2, 5, 8, 9, 13]. To de- 
termine the suitability of the genetic algorithm and constraint handling to 
the data warehouse optimization problem, we compare it to a widely used 
greedy algorithm. We observe that for a generalized solution space and for 
large optimization problems, the greedy algorithms presented in [13] have 
poor performances compared to the genetic algorithm presented in [23]. 

We also adapt SRGA (Stochastic Ranking evolutionary (Genetic) Algo- 
rithm) introduced in [23]. The algorithm was used in SFI system, i.e. it was 
applied to generalized solution space, based on our multi-dimensional model. 
Further, two novel hybrid algorithms named GGLA (Greedy-Genetic Linear 
Algorithm) and GGBA (Greedy-Genetic Binary Algorithm) are presented. 
Both of them are combination of Greedy and Genetic algorithms. The ge- 
netic parts of the algorithms are also based on SRGA. The algorithms were 
applied to the generalized solution space. 

The algorithms were applied to large optimization problems. For com- 
parison, in [23] the SRGA algorithm was successfully applied to the solution 
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space that consists of 16 to 256 views, while in our work the algorithms were 
successfully applied to the solution space (views, fragmented views and in- 
dexes) up to 1110 objects. Furthermore, the system was evaluated using 
non-linear objective function and tested for complex workload. To show the 
efficiency of our novel algorithms we have compared them with SRGA. We 
have made many experiments which verified dramatic improvements (up to 
280%) of the optimization process. The conclusion is that our optimization 
algorithms are much more effective and powerful then that developed in 
[23]. 

The paper is organized as follows: in Section 2 we introduce the defini- 
tion of the solution space, in Section 3 we consider the optimization prob- 
lem and present generalized genetic algorithm (SRGA) and novel hybrid 
algorithms (GGLA, GGBA) that successfully solves the aforementioned op- 
timization problem. In Section 4 we show an experimental evidence of the 
efficiency of our approach and our algorithms applied to generalized solu- 
tion space and finally, in Section 5 we give conclusions and post several open 
problems. 


2 Definition of a Solution Space 


In this section we define the generalized solution space of the optimization 
system. The solution space is based on a RDW schema which includes 
definition of dimension relations, all possible aggregate views, their different 
states of normalization, named variants of views and relational dimensions 
as well as all possible indexes. 


2.1 Definition of an Implementation Schema 


The warehouse data are multidimensional in nature, conceptually organized 
in a multidimensional data cube. The data are stored in specialized rela- 
tions (tables), called fact and dimension relations. In the most real-life im- 
plementations the dimensions consist of more than one attribute, organized 
in attribute hierarchy. According to the attribute hierarchy presentation 
there are two schemas of implementation: star and snowflake schema. 

In the star schema all attributes of each dimension are stored in one 
relation, i.e. the attribute hierarchy is presented implicitly. Data in the 
schema are denormalized which provides optimized complex aggregate query 
processing. 
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On the other hand, in the snowflake schema, attributes of each di- 
mension are normalized and stored in different relations, i.e. the attribute 
hierarchy is presented explicitly. The snowflake schema offers flexibility, 
however, it is often at the cost of performance since more joins for each 
query are required. 

In this subsection we define a multidimensional model which captures 
both schemas as extreme states and furthermore all intermediate state of 
(de)noramlization. 


Definition 1 The dimension D is defined as the pair D= (Rp, Hp) with: 


e Rp = {R,,..., Ry} #0 as the set of base (dimension) relations, where 
each dimension relation R; is characterized by the basic set of at- 
tributes BA; = PA; JDA;UF'A;, where PA; 4 0 is the set of primary 
key attributes (identifying attributes), DA; 4 Q is the set of descriptive 
attributes and FA; is the set of foreign key attributes. 


e Hp = {M,,...,Hm} as the set of dependencies between dimension re- 
lations named dimension hierarchy, where each H; is characterized 
by two dimension relations Rj, Rp, and it is presented by the pair 
H,(R;, Rp), and also PA; C FAp. 


We note that for each dependence H;(R;, Rp) the relation PA; C FA, is 
satisfied. Intuitively, the dimension D is relational schema in 3th normal 
form - NF and no functional dependencies exist between attributes of each 
DA;. However, for the dimension D the next restriction is introduced: 

The graph of dimension D, obtained by interpreting dimension rela- 
tions as nodes and dependencies between them as arcs, is a directed acyclic 
graph with the following elements: 


e Exactly one root node, i.e. dimension relation R,. , satisfies VH;(R;, Rp) € 
Hp, Ry # R;; 


e Non empty set of leaf nodes {R,|VH;(R;, Rp) € Hp, Ri Ry}; 


For better formal presentation the functions fR, fPA and fBA are 
defined as fR(PA) = R, fPA(R) = PA (fR = fPA'), fBA(R) = BA, 
where # is a dimension relation and PA, BA are its primary and basic set 
of attributes, respectively. 
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Definition 2 The extended set of attributes EA; of dimension relation 
R; € Rp of dimension D = (Rp, Hp) is defined by the following algo- 
rithm: 


Find_Set (R;) 
Begin 
EA; = fBA(Ri) 
For all Ry such that there exists Hi(Rz, Rj) € Hp Do 
EA; = EA;U FindSet (Rx) 
End_For 
Return EA; 
End Find_Set 


The additional set of attributes AA; of dimension relation R; is defined 
as AA; = EA;\BA;. Additional sets of attributes are important to realize 
our idea of denormalization of the schema of dimension relations. Actually, 
for each dimension relation, attributes of its additional set can be added to 
the basic set. Also, fEA function with fF A(R) = EA is defined, where R 
as a dimension relation and EA as its extended set of attributes. 


Definition 3 The data cube DC is defined as the pair DC = (DCp,M), 
where DCp = {Dj,...,Dn} is the set of dimensions and M is the set of 
measure attributes. 


Definition 4 The implementation schema SC of data cube DC is defined 
as the pair SC = (DC, AV), where AV = {Vj, ..., Vp} is the set of aggregate 
views, where each V, is characterized by: 


e set of measure attributes Mj C M, where M is the set of measure 
attributes of DC; 


e basic set of grouping attributes defined as BGA; = Up,epct, fPA(Ri), 
where DCL C DCp is a subset of dimensions of DC’ and Ri € Rp,; 


e if fPA(R:), fPA(Ri,) C BGA, and Ri, Ri, € Rp, > j = m,Vj,m= 
1,...,k;, where kj is the number of relations of D; dimension. 


Intuitively, aggregate views can contain different subsets of the set of mea- 
sure attributes (first item) which provide vertical fragmentation of measure 
attributes. Between the appropriate dimension relations and the aggregate 
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views there are 1:M relationships, which means that the primary key at- 
tributes of dimension relations at the same time are grouping and foreign 
key attributes in the corresponding aggregate view (second item). Group- 
ing ("group by”) attributes of any aggregate view consists of primary key 
attributes from at most one relation of each dimension (third item). 

The set of all dimension relations in the implementation schema SC is 
Rsc = Up,epcp Rp;. 

To ensure that all data of data cube DC will be stored in at least one 
view of implementation schema SC, it is necessary to define the view with 
highest granularity of data and with all measure attributes. 


Definition 5 The view V, is primary aggregate view of the implementation 
schema SC = (DC, AV) if it is characterized by Mp = M as the set of 
measure attributes and by BGA, = Up,epc, f PA(R) as the set of group- 
ing attributes, where DC'p is the set of dimensions of data cube DC and 
R* is root relation of D; dimension. All other aggregate views are named 
supporting views. 


In the next two definitions, similarly to the definitions for dimension 
relations, we define extended and additional set of attributes of aggregate 
views. 


Definition 6 The extended set of grouping attributes EGA; of aggregate 
view V; is defined as EGA; = Urpa(rycaca,f EA(R), where BGA; is the 
basic set of grouping attributes of V;. 


The additional set of grouping attributes AGA; of aggregate view V; is 
defined as AGA; = EGA;\BGA;,. 

Also, the definition of data dependencies between aggregate views of a 
data cube implementation schema is introduced: 


Definition 7 The aggregate view V; depends on the aggregate view V; in the 
implementation schema SC = (DC, AV) if: 1. M; GC Mj; and 2. BGA; © 
EGA;. 


Intuitively, V; can be created using tuples of V;. The data of V; are at 
a lower level of granularity than data of V;. Note that using our definition 
of aggregate views and the second item of the definition of view dependency 
one can create structural lattice (directed graph) in a similar way as the 
lattice introduced in [8]. 
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Lemma 1 [f the aggregate view V; depends on the aggregate view V; in the 
implementation schema SC = (DC, AV), then EGA; C EGA;. 


Proof: Let assume that 4D, € DC with the extended set of attributes 
EA, and SBA; C EA, and BA; C EGA;\EGA,;. Let R* be the root 
relation of D;,, then from Definition 1 and 4, BA, C BG‘A;, where BA, = 
fBA(R*) and BG4A,; is the basic set of attributes of Vj. Then BA, C EGA; 
and from construction of EGA; (Definition 2 and 6) = BA; C EGA;, which 
is a contradiction. 


2.2 Vertical Fragmentation of Views 


Our idea is to start from the fully normalized data cube schema and con- 
sider all possible states of denormalization. We have used a theorem which 
formally proves that the dimension relation of a star schema can be consid- 
ered as a view defined on the respective relations of the snowflake schema 
(see [18]). There are two possible levels of denormalization. The first one is 
the denormalization of dimensional relations. For each dimension relation, 
it is possible to add any subset of its set of additional attributes to its basic 
set of attributes. 


Definition 8 The dimension relation R' derived from the dimension rela- 
tion R by adding (possible empty) subset AA’ C AA to BA is a variant 
of original relation R, where R is characterized by BA as the basic set of 
attributes and by AA as the additional set of attributes. 


The relation variant R’ is characterized by its basic set BA’ = BAUAA’ and 
note that BA C BA’ C EA. The second level is the denormalization of the 
aggregate views. In a similar way, for each aggregate view, it is possible to 
add any subset of its set of additional attributes to its basic set of grouping 
attributes. 


Definition 9 The aggregate view V* derived from the aggregate view V 
by adding (possible empty) subset AGA’ C AGA to BGA is a variant of 
the original view V, where V is characterized by BGA as the basic set of 
grouping attributes and by AGA as the additional set of grouping attributes. 


The aggregate view variant V* is characterized by its basic set BGA’ = 
BGAU AGA’ and note that BGA C BGA’ C EGA. 
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Theorem 1 [f the aggregate view V;, depends on the aggregate view V, then 
any variant of V;, depends on any variant of Vp. 


Proof: All variants of the same aggregate view have the same set of measure 
attributes, so it is necessary to prove only the second condition of Definition 
7. From the construction of extended set of attributes (Definition 2 and 6) 
and from Definition 9 of view variants follows that extended set of attributes 
is the same for all variants of the same view. So BGA; C BGAi C EGA, 
and BGA; C EGA,, then from Lemma 1, follows BGA, C EGAjp. 


The previous theorem assures that data dependencies between view 
variants are the same as view dependencies defined in Definition 7. 

We note that the number of variants of aggregate views depends ex- 
ponentially on the number of elements of their additional set of grouping 
attributes. But, to simplify the problem, i.e. to decrease the number of 
variants, according to a priori defined heuristic, it is possible to group ad- 
ditional attributes of the same set in subsets of attributes. In this case the 
number of variants will depend exponentially on the number of new subsets. 


2.3. Indexes 


The next step in definition of solution space is determining the indexing 
strategy, which is very important since indexes can improve query execution 
time substantially. 


Definition 10 The index I on the aggregate view V; in the implementation 
schema SC = (DC, AV) is defined as the tuple I = (Vj, TAr,1A%), where 
V; € AV, IA; C EGA;, IA; # O is the set of indexed attributes and TA* 
is the ordered sequence of all elements of IAy. 


We note that EGA; is the extended set of grouping attributes of the ag- 
gregate view V;. Intuitively, there can be several B-tree indexes on a given 
view, i.e. one index for every subset of the attributes of an extended set 
and every ordering of the subset. 


Definition 11 The index I is applicable on the variant V; of aggregate 
view V; if TA; © BGA, where IA; 4 @ is the set of indexed attributes 
and BGA; C EGA; is the basic set of attributes of Ue and EGA; is the 
extended set of grouping attributes V;. 
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The index J is named as fat index on V; if IA; = BGA‘ and index I is 
named as slim (one-attribute) index if |JA;| = 1, ie. cardinality of IA; is 
1. The set of indexes in implementation schema SC is SJ. 

According to the theory of indexing, the novel bitmapped indexes are 
very suitable for processing of data cube based queries. A bitmap index 
is usually defined for one attribute. However it is easy to process complex 
conditions involving more than one attribute. Thus, in experiments of this 
work one-attribute bitmap indexes were considered. Note that the number 
of one-attribute bitmap indexes of aggregate view V; is equal to the number 
of elements of its extended set of grouping attributes EG'A;. In rest of this 
paper terms (one-attribute) bitmap index and index will be used simulta- 
neously. The sequence of all indexes of aggregate view Vj; is disposed in 
advance. 


Definition 12 The view length Ly, of aggregate view Vj is defined as Ly, = 
1+k+m+n, where k is the number of elements of the additional set of 
grouping attributes of the view AGA;, m is the number of elements of the 
set of measure attributes and n is the number of bitmap indexes of the view, 
i.e. number of elements of the extended set of grouping attributes EGA;. 


The parameters Ly, are necessary to calculate the number of bits 
(genes) for representation of the solutions in our algorithms. 


2.4 Workloads 


A DW responds to a large number of data cube based ad-hoc queries, i.e. to 
dynamic and in principle unpredictable queries. However, lot of them can 
be determined a priori and can be formalized. In [17] processing of grouping 
queries on DW was researched, while in [8] all possible slice queries based 
on data cube views were considered. Selection (” where clause”) attributes 
are disjoint from grouping attributes. That means they first select data by 
a certain set of attributes and after that they group them by another set of 
attributes. But in [8] different data operators, i.e. predicates in selection 
expressions are not considered. The predicates in selection expressions are 
also important for query processing time, i.e. different predicates returns 
different numbers of tuples. Therefore, in this paper, in our query definition, 
a set of different predicates is included. 


Definition 13 The query Q in the implementation schema SC = (DC, AV) 
is defined as the tuple Q = (Mg, GAg, Pg, Fg), where Mg C M is the set 
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of measure attributes, GAg GC EGAp, is the set of grouping attributes, Po 
as the selection expression and Fg is the expected frequency of the query. 


The selection expression Pg of the query Q@ is in the form: 
piANDp2AND...ANDpm, where p; : AjOValue;, 9 € {=,<,>,<,>} is 
a data operator, Value; € Dom(A;). The set SAg defined as SAg = 
{A1,..., Am} is the set of selection attributes and SAg C EGA, (EG‘Ay is 
the extended set of attributes of the primary view). The data operators are 
divided in two groups: equality and inequality data operators. Inequality 
operators usually return result which contains more then one tuple, so af- 
ter selection by certain set of attributes it is reasonable to group resulting 
data by the same set of attributes. By this reason, sets of grouping and 
selection attributes are not disjoint. The set of all possible queries in the 
implementation schema SC is SQ. 


Definition 14 The ratio of equality operator ER in the selection expression 
of query Q is defined as k/n, where n > 0 is the number of data operators, 
i.e. the number of elements of the set of selection attributes SAg and k is 
the number of equality operators. 


The ratio of equality operator is necessary to estimate the number of tuples 
returned in each step of query execution, i.e. to calculate query execution 
cost. This is important to develop a realistic query evaluator. 


Definition 15 The query Q is computable from (can be answered by) the 
aggregate view V in the implementation schema SC = (DC,AV) if: 1. 
Mg © My and 2. GAgUSAg € EGAy, where My, EGAy is the set of 
measures and the extended set of grouping attributes of V, respectively. 


The previous definition is necessary to determine the set of views from which 
each query can be answered. The next step is to determine such view from 
which query is answered by the lowest execution cost. 


Theorem 2 I/f the query Q is computable from the aggregate view V then 
Q is computable from any variant of V. 


Proof: The first condition of Definition 15 is satisfied because all variants 
of the same aggregate view have the same set of measure attributes and 
proof of the second condition directly follows from Definition 9 of variants 
of views. 
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The important issue is how to select such a variant of each dimensional 
relation, such a set of views for materialization (each view presented with 
its proper variant) in order to minimize the total query processing time of 
queries with a certain constraint. 


3 Definition of an Optimization System 


3.1 A Practical Example 


For better representation of our formal model the practical example of sale 
database system is considered. The three dimensional data cube SALE 
with SALEp = {I,5,D} and M = {S_quantity, S_amount, S_price} is 
considered. The dimension relations, organized in dimension hierarchies, are 
shown in Figure 1. Simplified schema of the data cube, with all dimension 
relations and only with two views (the primary view and one supporting 
view of the schema) formally is described by: 


Item (I_id, It_id, I_name); 

Item_type(It_id, It_name) 

Supplier (S_id, C_id, S_name) ; 

City (C_id, Co_id, C_name); 

Country (Co_id, Co_name) ; 

Date (D_id, W_id, M_id, D_date); 

Week (W_id, D_week); 

Month (M_id, Y_id, D_month) ; 

Year (Y_id, D_year); 

Sale_ISD (I_id, S_id, D_id, S_quantity, S_amount, S_price); 
Sale_ItS (It_id, S_id, S_quantity, S_price); 


and graphically is shown in Figure 2. 

Set of relations of dimension IJ is R; = {Item, Item_type}. Relation 
Item is characterized by PA={ Lid}, DA={I_name}, FA={It_id}, BA={ Lid, 
Lname, It_id}, EA={Lid, Lname, It_id, It.name} and AA={It_name}. 
Number of variants of Jtem relations is 2! = 2. Dimension hierarchy of 
dimension I is defined as Hy = {It — I}, where It —I : Item_type — 
Item, ie. It — I(Item_type, Item). According to the dimension hierar- 
chies, the number of aggregate views of schema of the data cube SALE 
is 60. Primary aggregate view is Sale_ISD and is characterized by set 
of measure attributes Mgaicrsp = {S_quantity, S_amount, S_price} and 
by BGAgatersp = {I-id, Sid, Did} as basic set of grouping attributes. 
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tem Supplier Date Year Country = fe—>>} City ltem_type 


ltem_type City Month Week Month j¢—>>} Date Supplier ltem 
; f F a F = 7 
“as i i 
Country Year M:1 relationship Week Sales-ISD }«e—! | Sales-its }«——! 


Figure 1: Dimension hierarchies for Figure 2: Simplified schema of data 
I, S and D dimensions cube SALE 


Extended set of grouping attributes is EGAgaie_rsp={Lid, name, It_id, 
It_name, S_id, S_name, Cid, C_name, Co_id, Co_name, D_id, D_date, W_id, 
D_week, M_id, D-month, Y_id, D_year}, i.e. all attributes. Number of vari- 
ants of aggregate view Sale_ISD is 2!° = 32768. Supporting aggregate 
view Sale_ItS is characterized by set of measure attributes Msaie ns = 
{S_quantity, S_price} and by BGAgaie_ng = {It-id, S_id} as basic set of 
grouping attributes. Extended set of grouping attributes is EGAggie_ng= 
{It_id, It_name, S_id, S_name, Cid, C_name, Co_id, Co_name}. Number 
of variants of Sale_ItS is 2° = 64. 
Note that Sale_ItS is computable from Sale_ISD. Actually, if Sale_IS'D 

is given by its zero variant Sale_ISD°, this means that the view Sale_ItS 
can be created by the following SQL statement: 


CREATE MATERIALIZED VIEW Sale_ItS AS 

SELECT I.It_id, F.S_id, SUM(F.S_quantity) S_quantity, 
AVG(F.S_price) S_price 

FROM Sales_ISD F, Item I 

WHERE I.I_id=F.I_id 

GROUP BY I.It_id, F.S_id; 


To create (compute) Sale_ItS it is necessary to join aggregate view Sale ISD 
and dimension relation Item. But if Sale_ISD is given by its variant 
Sale ISD! characterized by BEAL. re = {lid, Itid, It name, S_id, 
S_name, D_id} then the view Sale_ItS can be created by the following 
SQL statement: 


CREATE MATERIALIZED VIEW Sale_ItS AS 

SELECT F.It_id, F.S_id, SUM(F.S_quantity) S_quantity, 
AVG(F.S_price) S_price 

FROM Sales_ISD F 
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GROUP BY F.It_id, F.S_id; 


Note that maintenance cost of Sale_ISD is increased by adding the new 
attributes, but at the same time maintenance cost of S'ale_ItS is decreased. 
Also note that view Sale_ItS' is presented by its zero variant BG Agate ng = 
BGA ute ng = {It-id, Sid}. 

If we assume that the view Sales_ItS' is presented by its zero variant, i.e. 
materialized by above SQL statement, then to process the following query, 
named QSale (sales of type items ” Milk Products” by supplier names), it is 
necessary to join proper dimension relations to the materialized view: 


SELECT F.It_id, F.S_id, S.S_name, SUM(F.S_quantity) S_quantity 
FROM Sales_ItS F, Item_type It, Supplier S$ 
WHERE It.It_name="Milk Products" AND F.It_id=It.It_id 
AND F.S_id=S.S_id 
GROUP BY F.It_id, F.S_id, S.S_name; 


Formally, the query QSale can be described by following sets: Mgsale = 
{S_quantity} as set of measure attributes, GAgsale = {[t-id, S_id, S_name}, 
Posate : Itmname = ” MilkProducts” as selection expression and set of 
selection attributes SAgsaie = {Itmname}. Expected query frequency id 
FQsale = 1. To reduce the processing time required for joining views and di- 
mension relations, frequently accessed attributes can be stored into materi- 
alized views. Thus, another possible way to create view Sales_ItS, from vari- 
ant of Sales_ISD with BGA2,), ;¢p = {1-id, It id, It_name, Sid, S_name, 
D_id}), i.e. another variant of Sales_ItS is: 


CREATE MATERIALIZED VIEW Sale_ItS AS 

SELECT F.It_id, F.It_name, F.S_id, F.S_name, 
SUM(F.S_quantity) S_quantity, AVG(F.S_price) S_price 

FROM Sales_ISD F 

GROUP BY I.It_id, It.It_name, F.S_id, 5.S_name; 


The second variant Sales_ItS! of the view Sales_ItS is characterized by the 
set BGA). 15 = {It-id, It nameS_id, S_name} of grouping attributes. It 
is easy to note that the number of tuples is the same as the first variant of the 
view. To process previous query now it is not necessary to join dimension 
relations to fact relation and the SQL statement for the query is: 


SELECT F.It_id, F.It_name, F.S_id, F.S_name, F.S_quantity 
FROM Sales_ItS F 
WHERE F.I_tname="Milk Products"; 
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The advantage of using the second variant is decreasing the processing 
time of the queries, but the disadvantage is increasing the storage space, 
i.e. maintenance cost. In [5] some frequently accessed dimension keys and 
attributes are stored in various materialized views. However, a serious prob- 
lem is to consider set of all possible variants of the data cube views and to 
find the optimal one with largest benefit of query processing and minimal 
maintenance cost. 


3.2. An Objective Function 


In this section we will propose a suitable evaluation function of the opti- 
mization process. Let SCyy be the state of the data cube schema SC’ with 
the set AVjy C AV of candidate views for materialization where each of 
them is presented by exactly one of its variants and the set Sly, C SI 
of its candidate indexes. Let also all dimension relations be presented by 
their appropriate variants. Then maintenance-cost constrained optimiza- 
tion problem is the following one: Select a state SCyy of data cube schema 
SC that minimizes 


T(SC, SCm, SQ) = ~Q«sQ Fe * P(Q, SC), 


under the constraint U(SC,SCy) < S, where SQ is the set of predefined 
queries, Fg is query frequency and P(Q,SCj,) denotes the minimum pro- 
cessing cost of the query Q in the SCyy state of implementation schema 
SC. 

Let U(SC,SCy,) be total maintenance cost defined as: 


U(SC, SC) = direrso GR* MR, SCu) + Viveavy, Gv * MV, SCu) + 
resig Gr ok m(IL, SCm), 


where Gr, Gy and G_, is update frequency of relations, views and indexes, 
respectively. Let m(R, SCyz), m(V, SCs) and m(I, SCyz) be the minimum 
cost of maintaining relations, views and indexes, respectively in presence of 
state SCyy. 

We note that P(Q,SCj,) is objective function of the problem. To cal- 
culate values of the functions P(Q,SCyr), m(R,SCyz), m(V,SCy) and 
m(I,SCy,) we developed algorithms based on common query execution 
(processing) theory, presented in [7] and also on some ideas from [1, 2, 
11, 14], as well as [17]. 
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The common method for dealing with constrained optimization prob- 
lems is to introduce a penalty function to the objective function. We de- 
fined the penalty function as ¢(SC,SCy) = Max{U(SC,SCy) — S,0}. 
The function is used in the Algorithm 2 for comparing pairs of two adjacent 
chromosomes. 


3.3 Handling of Logical Constraints 


As we said, each view is presented by exactly one bit, followed by bits used 
to present its additional attributes (view variants), its measure attributes 
and its indexes. If a view is not selected for materialization, all its bits 
which present view variants are irrelevant for evaluation of the current so- 
lution, i.e. adding additional attributes to the view is logically infeasible. 
We have named those bits as recessive bits. In this case the bits which 
present measure attributes and indexes of the view are also recessive bits. 
Thus, there are three types of recessive bits: for representations of view 
variants, for the measures attributes and for the indexes, named as variant, 
measure and index bits, respectively. Note that the number of recessive bits 
is very large in regard of other bits. They are irrelevant for evaluation of 
the current solution, but they are important for optimization process in the 
next generations, i.e. they are important for GA operations: crossover and 
mutation. This means that recessive bits comprise large genetic material 
and by their handling we can change the direction and we can improve the 
performances of the optimization process. 

For example, if the view is not chosen for materialization in the first 
generation and all recessive bits have value 0, then if in the second genera- 
tion a bit that presents a view mutate to 1, then that view will be chosen 
for materialization and it will be presented by a basic variant, without ad- 
ditional attributes and without indexes. For the opposite situation, if all 
recessive bits have value 1, then after a mutation, the view will be presented 
by variant with all additional attributes and with all indexes. In this paper 
we examine three different strategies for recessive bits handling. By the first 
strategy (RBR) recessive bits are generated randomly and can be changed 
by operation of mutation. By next two strategies all recessive bits have 
fixed value 1 (RB1) and 0 (RBO), respectively and can not be changed by 
operation of mutation. Effects of each strategy to the optimization process, 
according to the different parameters, are shown in next section. 
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3.4 The Generalized Genetic Algorithm 


In this paper SRGA (Stochastic Ranking Genetic Algorithm) is adapted and 
applied to the solution space generalized by including view fragmentation 
and index selection, so also we named the algorithm as GGA (Generalized 
Genetic Algorithm). 

First, we present the solution space for SRGA by an array of bits, i.e. by 
a chromosome. We start with the representation of dimension relations (as 
special types of views for materialization), including their different variants. 
The number of bits that are needed to present each dimension relation 
with all its variants is loggn +1, where the n is the number of all possible 
variants and logy n is equal to the number of elements of an additional set 
of attributes. Thus, the first bit is used to present a dimension relation 
and other n bits to present additional attributes. As we said, all dimension 
relations must be materialized, thus each of them has a value of 1. If an 
attribute of the additional set is to be added to its dimension relation, it 
has a value of 1 and if it is not added it has a value of 0. Aggregate 
views are presented in similar way, i.e. for each view, one bit is used for 
its representation (by 1 if it is selected or by 0 if it is not selected for 
materialization) and n bits are used to present its additional attributes. 
The attributes of additional sets of aggregate views are presented in the 
same way as attributes of additional sets of dimension relations. Finally, 
for each aggregate view, the measure attributes are presented. A measure 
attribute, if it is added to the appropriate view has a value of 1 and if it 
is not added, then it has a value of 0. Note that for each view at least 
one measure attribute must be added. The representation of each view is 
followed by representation of its possible indexes. Each index is presented 
exactly by one bit, i.e. by 1 if it is selected or by 0 if it is not selected. The 
number and sequence of all indexes are disposed in advance. 

For example, the view length of aggregate view Sales_ItS is Lgate_ 1s = 
14+6+2+4+8=17. If Sales_ItS is presented by the 10101001010101000, 
then the first bit is used for representation of the view - it is selected for 
materialization. The next 6 bits are used for representation of the addi- 
tional attributes - the view is presented by its variant with 2 additional 
attributes (S_name, C_name). The next 2 bits are used for representation 
of the measure attributes - the view has 1 measure attribute (S_quantity) 
The last 8 bits are used for representation of the indexes - 3 indexes for 
Itid, S_id and Cid attributes are selected for materialization. If the view 
is not materialized, all its variant, measure and index bits are irrelevant for 
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evaluation of the current solution. 

Input parameters of the SRGA algorithm are: NC - Number of Chro- 
mosomes (population size), NG - Number of Generations, LC - Length of 
Chromosome (number of bits needed to present whole solution space), PC 
- Probability of mutation of Chromosomes, PG - Probability of mutation of 
Genes, Rgc - set of dimension Relations, AV - set of Aggregate Views (in- 
cluding dimension relations as special types of views), SJ - Set of Indexes, 
SQ is Set of Queries and PF is a Probability for rank-based selection, in- 
troduced in [23]. POP(i) presents i*” generation of the population. 

The initial population is randomly generated by the procedure C’e- 
ate_population according to the rules presented above. Each of its chro- 
mosomes is evaluated for a predefined workload, i.e. set of queries SQ by 
the procedure Evaluate_population. We note that Evaluate_population is 
implementation of the function 7(SC,SCy,SQ) and by POP(t) are pre- 
sented NC particular states of implementation schema SC’, each chromo- 
some of the population presents a different state. The procedure Evalu- 
ate_population_materialization evaluates maintenance solution cost for each 
chromosome of the population POP(i, LC) - it is implementation of the 
function U(SC, SCj,). 


Algorithm 1: SRGA(NC, NG, LC, PC, PG, AV, ST, SQ, PF) 
Begin 
Create_population(POP(1), NC, LC, AV, SI); 
Evaluate_population(POP(1), SQ); 
Evaluate _population_materialization(POP(1)); 
Fori=2 to NG Do 
Perform_crossover(POP(i — 1), NC); 
Perform_mutation(POP(i), PC, PG, NC); 
Evaluate_population(POP(i), SQ); 
Evaluate_population_materialization(POP(t)); 
POP(t) := Merge,opulations(POP(i — 1), POP(t)); 
Perform_selection(POP(i), NC, SQ, PF); 
End For; 
End GGLA; 


In order to obtain a population with better characteristics, the proce- 
dures Perform_crossover, Perform_mutation and Perform_selection perform 
GA operations - crossover, mutation and selection, respectively. We note 
that by applying the Perform_crossover(POP(i— 1), NC) procedure to the 
(¢ — 1)-th generation of population, NC new chromosomes of i-th genera- 
tion of the population are generated. We used a special case of multi-point 
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crossover operation where the number of crossover points is n — 1, where 
n is the number of genes, i.e. blocks with length of 1 gene. This crossover 
is known as uniform crossover. The procedures Perform_mutation, Eval- 
uate_population and Evaluate_population_materialization are applied to the 
new i-th generation. Probability to choose a chromosome for mutation is 
given by the parameter PC, and probability to mutate genes inside of a 
chosen chromosome is given by the parameter PG. Formally, the procedure 
Merge_populations adds chromosomes from previous (i — 1)-th to the newest 
i-th generation. 

The key difference from most GAs is the Perform_selection procedure, 
based on the stochastic ranking algorithm presented in [23]. The algorithm 
is presented in the next subsection. It is similar to bubble-sort algorithm 
used for ranking the union of chromosomes of last two generations. But in 
[23] the algorithm is used just for selection of views and it is applied to small 
solution space. In this paper the algorithm is applied to the generalized 
solution space which includes selection of views and indexes and selection 
of view fragments. Also, the algorithm is used to find a near optimal model 
of data cubes consist of a large number of objects. 


3.5 The Stochastic Ranking Procedure 


Finding a penalty coefficient optimal value is difficult and the penalty meth- 
ods setting a static or dynamic penalty coefficient value do not work well for 
the constrained optimization problems. In [23] is presented a new constraint 
handling technique, named stochastic ranking, to balance the dominance of 
the objective and penalty functions for the view selection problem. The 
novel idea of this technique is the introduction of a probability Py for rank- 
based selection. During the course of ranking, it is necessary to compare 
pairs of two adjacent individuals. If they are both feasible solutions, natu- 
rally, they will be compared according to the objective function. However, 
when either of them is infeasible, the probability of comparing them accord- 
ing to the objective function is Py, while the probability of comparing them 
according to the penalty function will be 1 — Pr. Since is a probability, it 
gives an opportunity for both the objective and penalty functions to rank a 
pair. When Py > 1/2, the ranking is biased toward the objective function. 
When Py < 1/2, the ranking is biased toward the penalty function. So 
Py can balance the objective and penalty functions more directly, explicitly 
and conveniently. Moreover, we do not have any extra computing cost for 
setting penalty coefficient values since we do not use any penalty terms. 


43 


In this paper the stochastic ranking algorithm is implemented by the 
Perform_selection procedure (see Algorithm 2). The probability Py is pre- 
sented by the parameter PF’. In our experiments we set the parameter value 
dynamically during the optimization process. For example, we usually set 
PF < 1/2 to reduce the ratio of infeasible solutions to the whole in the 
several final generations. 

Finally, we note that the constraints of the optimization process are 
incorporated into the algorithm through a stochastic ranking procedure. 

The input population PO P(i) consists of 2* NC chromosomes as union 
set of chromosomes from two generations. The procedure Delete_Chromosome 
eliminates NC worse chromosomes from the population. To select the chro- 
mosomes for elimination we use the negative (elimination) selection method 
presented in [12]. The reasons for that are: the population consists of 2* NC 
chromosomes, so in this case elimination selection is faster than generation 
selection; elitism is inherently involved in elimination selection. 


Algorithm 2: Perform_Selection(POP(i), NC, SQ, PF) 


Begin 
Fork =1 to NC Do 
Swap:=False; 


Forj =1 to NC «2-1 Do 
If (6(POP(i, j)) = (POPC, j + 1) =0) or 
(Random(0,1) > PF’) Then 
If T(POP(i, 7), SQ) > 7(POP(i,j +1),SQ) Then 
Swap_Chromosomes(POP(i,j), POP(i, j + 1)); 
Swap:=True; 
End If 
Else 
If é(POP(i,9)) > d(POP(i,j +1)) Then 
Swap_Chromosomes(POP(i, 7), POP(i, 7 + 1)); 
Swap:=True; 
End If; 
End If; 
End For; 
If Swap=False Then 
Exit For; 
End If, 
End For; 
Delete._Chromosome(POP(i), NC); 
End Perform_Selection,;; 
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3.6 The Novel Greedy-Genetic Algorithms 


In this section we will present two novel hybrid algorithms for optimization 
of RDW. Our algorithms are hybrid because they are combination of greedy 
and genetic algorithms. 

The Algorithm 3 is named GGLA - Greedy-Genetic Linear Algorithm. 
The parameter NF represents Number of Fragments of the solution space, 
which is equal to the number of steps of greedy procedure. All other input 
parameters of the algorithm are the same as in Algorithm 1. 

Parameters of the Algorithm 4 are: NL - New Length of chromosome 
(input), CL - Current Length of the chromosome (input/output), AVg - set 
of chosen views (input/output), AVc - set of chosen views in current step 
(output), AV - set of Aggregate Views (input). 


Algorithm 3: GGLA(NC, NG, LC, PC, PG, NF, AV, SQ) 
Begin 
AVz := 0; CL :=0; 
Choose_views(Round(LC/NF), CL, AVp, AVc, AV); 
Extend_chromosome(POP(1), AVc, NC); 
Evaluate_population(POP(1), SQ); 
Evaluate_population_materialization(POP(1)); 
j= 2; 
Forti =2 to NG Do 
If Mod(i, Round(NG/NF)) =1 Then 
Choose_views(Round(LC « j/NF'),CL, AVg, AVc, AV); 
Extend_chromosome(POP(i), AVg, NC); 
j= jt; 
End If; 
Perform_crossover(POP(i — 1), NC); 
Perform_mutation(POP(i), PC, PG, NC); 
Evaluate_population(POP(i), SQ); 
Evaluate_population_materialization(POP(1)); 
Perform_selection(POP(i), NC); 
End For, 
End GGLA; 


The GGLA algorithm is graphically shown in Figure 3. Before the 
optimization process starts, we order all aggregate views by their ratio of 
usability (UR), i.e their importance. The general idea is in certain steps 
(NF) by greedy procedure to choose the subsets of views with all their bits 
(Algorithm 4, i.e. procedure Choose_views) and to add them to already 
selected ones, i.e. to concatenate their randomly generated bits to the 
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already created chromosomes (Extend_chromosome procedure). By those 
procedures we roughly modulate the solution space. 

After each step of greedy procedure, we perform fine optimization by 
using GA (few generations on current solution space). The procedure Eval- 
uate_population evaluates the quality of solutions and the procedure Evalu- 
ate_population_materialization evaluates the maintenance solution cost. The 
procedures Perform_crossover, Perform_mutation and Perform_selection per- 
form GA operations - crossover, mutation and selection, respectively. 


Agorithm 4: Choose_views(NL,CL, AVp, AVo, AV) 
Begin 
AVo = 0; 
While NE > CL Do 
AVp := AVp UV;, where V; has maximal UR; in AV\AVz; 
AVo := AVo UV;; 
CL = CL + Ly, 
End While; 
End Choose_views; 


The next algorithm is named GGBA - Greedy-Genetic Binary Algo- 
rithm. All input parameters are the same as in Algorithm 3. 

The GGBA algorithm graphically is shown in Figure 4. POP(i, AV;) 
presents the i’” generation of the population and consists of a fragment of 
chromosomes represented by AV; subset of views. POP2(i) presents i!” 
generation of the population and consists of whole chromosomes (created 
by concatenation of all fragments of POP population), represented by set 
of all aggregate views AV. The population POP2 is necessary to evaluate 
the maintenance solution cost. The variable N'S' is the number of steps of 
the greedy procedure. 


46 


Algorithm 5: GGBA(NC, NG, LC, PC, PG, NF, AV, SQ) 
Begin 

Divide_views(N F, AV); 
Fork =1 to NF Do 

Create_population(POP(1, AV;,), NC); 

Evaluate_population(POP(1, AV), SQ); 

Concat_whole_chr(POP(1, AV;,), POP2(1)); 
End For; 
Evaluate_population_materialization( PO P2(1)); 
NS := loggNF +1; 
Fori=2 to NG Do 

If Mod(i, Round(NG/NS)) =1 Then 

Fork =1 to NF/2 Do 
Concat_chr(POP(i, AVe), AVaxk—1, AV2«k); 


End For; 
NF := NF/2; 
End If; 


Fork =1 to NF Do 
Perform_crossover(POP(i — 1, AV), NC); 
Perform_mutation(POP(i, AV;), PC, PG, NC); 
Evaluate_population((POP(i, AVi), SQ); 
Concat_whole_chr(POP(i, AV;,), POP2(i)); 

End For, 

Evaluate_population_materialization( PO P2(t)); 

Fork =1 to NF Do 
Perform_selection((POP(i, AV;), NC); 

End For, 

End For; 
End GGBA; 


In the procedure Concat_chr we define a new set of aggregated views 
AVx := AVoxp—1 UU AVoxk and new population POP(i, AV;,) which consists of 
chromosome fragments created by concatenation of chromosome fragments 
of POP(i, AV2.4-1) and POP(i, AV2.~) populations. The chromosome frag- 
ments of both populations are ordered from the best to the worst evaluated 
and concatenated fragments at the same position. We named this concate- 
nation strategy best-to-best. In similar way, the procedure Concat_whole_chr 
creates POP2(i) as concatenation from populations of all subsets of views 
POP(i, AV;,).. Here we also use best-to-best concatenation strategy. All 
procedures with the exception of Evaluate_population_materialization can 
be parallelized for different fragments of chromosomes, i.e. subsets AVz of 
AV which gives the total improvement of the performances of the algorithm. 
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4 Experimental Results 


In this section we present our experimental system and some of the experi- 
mental results obtained by it. For the efficiency of the optimization process 
using GA several input parameters are important. In our previous work we 
have already described some experiments with a wide range of their val- 
ues. In this paper we fixed the parameters to the following values: Number 
of Chromosomes (Population Size)- NC = 20, Probability of Chromosome 
mutation - PC’ = 0.5 and Gene mutation - PG = 0.05, so the overall prob- 
ability of mutation is 2.5%. 

To determine the suitability of the GA and constraint handling to the 
RDW optimization problem, we compare it to a widely used greedy algo- 
rithm (see Figure 5). The SRGA was compared to the greedy algorithm 
presented in [13] - GG (Greedy by Gupta). On the x-axis, the optimization 
process time is shown, even on the y-axis, the benefit of query process- 
ing cost (value of the evaluation function) is shown. We observe that for 
the generalized solution space and for large optimization problems, greedy 
algorithms have poor performances compared to the SRGA. 

In order to show the efficiency of vertical view fragmentation of the 
approach of view selection with vertical view fragmentation and indexes 
(SFI), we compared it to the approaches without fragmentation (SI), with- 
out indexes (SF) and without fragmentation and indexes (S). Exactly 1280 
experiments were performed with SRGA. For all experiments the termina- 
tion condition of optimization process was 50 generations. 

A comparison of the benefit of processing cost of all approaches, on 
different levels of maintenance cost constraints, is shown in Figure 6. The 
optimization process is considered with four different values of maintenance 
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cost from 25% to 100% of the maximal maintenance cost. The next impor- 
tant parameter of the optimization process is the solution space size, i.e. 
chromosome length. Comparison of different approaches according to the 
solution space size is shown in Figure 7. On y-axis the benefit of query 
processing cost relatively to the best approach is shown. Evidently, in all 
cases SFI approach has the highest benefit of processing cost. 

In our experiments we considered four different parameters in query 
definition: number of grouping (projection) attributes, number of selection 
attributes, ratio of the equality operators and number of measure attributes. 
For all parameters the SFI approach has the highest benefit of processing 
cost. The average benefit of processing cost of queries with 3 attributes in 
the selection expression for different ratios of the equality operators in the 
expression is shown in Figure 8. 

As we have described in the previous section, recessive bits are irrel- 
evant for evaluation of the current solution, but they could be important 
for optimization process in the next generations. In this work we examine 
three different strategies for recessive bits generation. By the first strat- 
egy (RBR), recessive bits are generated randomly and can be changed by 
operation of mutation. In the next two strategies, all recessive bits have 
fixed value 1 (RB1) and 0 (RBO), respectively and can not be changed by 
mutation operation. The benefit of the processing cost for each strategy 
according to the different values of maintenance cost is shown in Figure 
9. Comparison of different strategies according to the solution space size 
is shown in Figure 10. In both cases the best strategy is RB1, ie. to fix 
recessive to value 1. This doesn’t mean that RB1 leads to the extreme solu- 
tion in which all additional attributes and all indexes are selected, because 
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the maintenance cost constraints in next generations eliminate infeasible 
solutions that would be otherwise obtained by RB1. 

To compare our novel algorithms (GGLA and GGBA) with SRGA, 
exactly 192 experiments were performed. For all experiments termination 
condition of optimization process was 64 generations. All algorithms were 
applied to the generalized solution space based of our system SFT. 

We experimented with different values of the parameter NF - Number 
of Fragments. For GGLA the number of fragments is equal to the number 
of steps of the greedy procedure, while for GGBA the number of steps of the 
greedy procedure is given by logg NF'+1. A comparison of the optimization 
process execution time of all three algorithms, for different number of frag- 
ments NF, is shown in Figure 11. The optimization process was considered 
with four different values of NF parameter, from 4 to 32. Improvements 
of GGLA and GGBA over SRGA are evident when we increase the values 
of the parameter. The value of the parameter is limited by the number of 
generations of the optimization process (in our case 64). However, usually 
the number of generations increases by the increasing of the complexity of 
the problem, i.e. its solution space size. 

A comparison of the different algorithms according to the solution space 
size is shown in Figure 12. For better representation of the performances 
of the algorithms on the same chart, we scaled values of the optimization 
process execution time. On the y-axis the benefit of optimization process 
execution time relatively to the worst algorithm is shown. Evidently, in all 
cases GGBA algorithm has the highest improvement of the optimization 
time. Note that improvements of both GGLA and GGBA algorithms rise 
by increasing the solution space size, which is another important feature: 
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the scalability of our algorithms and their appropriateness for practical im- 
plementation. 

In our algorithms we use a priori knowledge about usability of aggregate 
views and implement that knowledge as a heuristic in the greedy procedure 
for choosing views. Using greedy algorithm in the first step enables us 
to decrease the genetic material, i.e. the number of calculation in genetic 
part of GGLA algorithm and the possibility to parallelize some segments 
of genetic part in the GGBA algorithm. Finally, we note that values of 
given solutions by using GGLA and GGBA are in the range of 98%-101% 
of solutions given by using SRGA. 


5 Conclusion and Open Problems 


The performance of the system of RDW depends on several factors and the 
problem of its optimization is very complex and making a perfect system is 
still a challenge. In this paper we have focused on generalization of the op- 
timization problem and improvement of the optimization process of RDW. 
We fully analyzed the problem by including lot of factors relevant for opti- 
mization of the system, i.e. view selection, vertical view fragmentation and 
index selection. Further we have focused on improvement of the efficiency 
of the optimization process. 
By doing so we have achieved the following: 


e We introduced a generalized model of optimization of RDW, named 
SFI, which is comprise of selection of views (S), their vertical frag- 
mentation (F) and their indexing (1). 
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e By using a GA we have addressed and solved the problem of general- 
ized model of optimization of RDW. 


e We have introduced a new technique of “recessive bits” and we have 
analyzed the effects of those bits to the optimization process. 


e We have performed a large set of experiments and confirmed the ben- 
efits of the SFI model according to several parameters. 


e We have defined two novel algorithms GGLA and GGBA and we have 
successfully applied them for fast finding optimal solutions in the gen- 
eralized solution space. 


We have achieved significant performance improvements of the opti- 
mization process compared to the stochastic ranking genetic algorithm and 
we have verified those improvements by performing a large set of experi- 
ments. The result is an optimization algorithm that is much more effective 
and powerful than that developed in [23]. For comparison, the GA algo- 
rithm developed in [23] was successfully applied to a solution space that 
consists of 16 to 256 views, while our algorithm can be successfully applied 
to the solution space (views, fragmented views and indexes) of up to 1110 
objects. Furthermore, the system was tested for complex queries with pro- 
jection, selection, join and grouping operations and with complex selection 
predicates. 

In the future, we plan to extend our multidimensional model by in- 
cluding horizontal partitioning and definition of a clustering strategy and 
to define an optimization process by applying the algorithms to the extended 
space. There is also the possibility to analyze different strategies for han- 
dling the recessive bits and the possibility to parallelize the algorithms on 
different levels such as: evaluation of chromosome within populations, paral- 
lelization of populations or parallel optimization of different data cubes. In 
this paper static algorithms are considered. Our plan is to develop dynamic 
algorithms for RDW design optimization. 
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